Transcript pps

Slide 1

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 2

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 3

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 4

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 5

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 6

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 7

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 8

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 9

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 10

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 11

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 12

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 13

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 14

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 15

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 16

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 17

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 18

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 19

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 20

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 21

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 22

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 23

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 24

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 25

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 26

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 27

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 28

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 29

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 30

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 31

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 32

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 33

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 34

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 35

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 36

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 37

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 38

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 39

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 40

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 41

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 42

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 43

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 44

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 45

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 46

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 47

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 48

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 49

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 50

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 51

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 52

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 53

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 54

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 55

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 56

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 57

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 58

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 59

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 60

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 61

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 62

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 63

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 64

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 65

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 66

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 67

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 68

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 69

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 70

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 71

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 72

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 73

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 74

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 75

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 76

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 77

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 78

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 79

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 80

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 81

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 82

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 83

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 84

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 85

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 86

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 87

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 88

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 89

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 90

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 91

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 92

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 93

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 94

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 95

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 96

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 97

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 98

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 99

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 100

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 101

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 102

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 103

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 104

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 105

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 106

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 107

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 108

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 109

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 110

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 111

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 112

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 113

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 114

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 115

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 116

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 117

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 118

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 119

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 120

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 121

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 122

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 123

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 124

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 125

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 126

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 127

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 128

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 129

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 130

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 131

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 132

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 133

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 134

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 135

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 136

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 137

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 138

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 139

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 140

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 141

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 142

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 143

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 144

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 145

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 146

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 147

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 148

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 149

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 150

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 151

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 152

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 153

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 154

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 155

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 156

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 157

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 158

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 159

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 160

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 161

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 162

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 163

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 164

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 165

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 166

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 167

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 168

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 169

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 170

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 171

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 172

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 173

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 174

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 175

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 176

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 177

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 178

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 179

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 180

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 181

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 182

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 183

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 184

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 185

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 186

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 187

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 188

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 189

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 190

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 191

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 192

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 193

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 194

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 195

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 196

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 197

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 198

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 199

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 200

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 201

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 202

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 203

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 204

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 205

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 206

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 207

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 208

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 209

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 210

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 211

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 212

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 213

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 214

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 215

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 216

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 217

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 218

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 219

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 220

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 221

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 222

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 223

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 224

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 225

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 226

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 227

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 228

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 229

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 230

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 231

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 232

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 233

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 234

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 235

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 236

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 237

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 238

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 239

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 240

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 241

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 242

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 243

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 244

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L


Slide 245

Algoritmi per IR

Prologo

What Google is searching for...
Algorithm Complexity: You need to know Big-O. [….]
Sorting: Know how to sort. Don't do bubble-sort. You should know the details of at least one n*log(n) sorting
algorithm [….]
Hashtables: Arguably the single most important data structure known to mankind. [….]

Trees: Know about trees; basic tree construction, traversal and manipulation algorithms. Familiarize yourself with
binary trees, n-ary trees, and trie-trees. Be familiar with at least one type of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree, and know how it's implemented. Understand tree traversal algorithms: BFS
and DFS, and know the difference between inorder, postorder and preorder.
Graphs: Graphs are really important at Google. There are 3 basic ways to represent a graph in memory; familiarize
yourself with each representation and its pros & cons. You should know the basic graph traversal algorithms:
breadth-first search and depth-first search. Know their computational complexity, their tradeoffs, and how to
implement them in real code. If you get a chance, try to study up on fancier algorithms, such as Dijkstra and A*.
Other data structures: You should study up on as many other data structures and algorithms as possible. You
should especially know about the most famous classes of NP-complete problems, such as traveling salesman and the
knapsack problem, and be able to recognize them when an interviewer asks you them in disguise. Find out what NPcomplete means.
Mathematics: …
Operating Systems: …
Coding: …

References
Managing gigabytes
A. Moffat, T. Bell e I. Witten, Kaufmann Publisher 1999.

Mining the Web: Discovering Knowledge from...
S. Chakrabarti, Morgan-Kaufmann Publishers, 2003.

 A bunch of scientific papers available on the course site !!

About this course


It is a mix of algorithms for


data compression



data indexing



data streaming (and sketching)



data searching



data mining

Massive data !!

Paradigm shift...

Web 2.0 is about the many

Big


DATA  Big PC ?

We have three types of algorithms:


T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit


How many input data n each algorithm may process
within t time units?




n1 = t,

n2 = √t,

n3 = log2 t

What about a k-times faster processor?
...or, what is n, when the time units are k*t ?


n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario
 Data are more available than even before

n➜∞
... is more than a theoretical assumption
 The RAM model is too simple

Step cost is W(1) time

Not just MIN #steps…

1
CPU

CPU
L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Gbs
Tens of nanosecs
Some words fetched

Few Tbs
Few millisecs
B = 32K page

Many Tbs
Even secs
Packets

You should be “??-aware programmers”

I/O-conscious Algorithms
track

read/write head
read/write arm

magnetic surface

“The difference in speed between modern CPU and disk
technologies is analogous to the difference in speed in
sharpening a pencil using a sharpener on one’s desk or
by taking an airplane to the other side of the world and
using a sharpener on someone else’s desk.” (D. Comer)

Spatial locality vs Temporal locality

The space issue





M = memory size, N = problem size
T(n) = time complexity of an algorithm using linear space
p = fraction of memory accesses [0,3÷0,4 (Hennessy-Patterson)]
C = cost of an I/O
[105 ÷ 106 (Hennessy-Patterson)]

If N=(1+f)M, then the D-avg cost per step is:
C * p * f/(1+f)
This is at least 104 * f/(1+f)
If we fetch B ≈ 4Kb in time C, and algo uses all of them:
(1/B) * (p * f/(1+f) * C) ≈

30 * f/(1+f)

Space-conscious Algorithms

I/Os
search
access

Compressed
data structures

Streaming Algorithms
track

read/write head
read/write arm

magnetic surface

Data arrive continuously or we wish FEW scans


Streaming algorithms:




Use few scans
Handle each element fast
Use small space

Cache-Oblivious Algorithms
CPU

L1

L2

RAM

HD

net

registers

Cache
Few Mbs
Some nanosecs
Few words fetched

Few Tbs
Few Gbs
Few millisecs
Tens of nanosecs
Some words fetched B = 32K page Many Tbs
Even secs
Packets

Unknown and/or changing devices


Block access important on all levels of memory hierarchy
But memory hierarchies are very diverse



Cache-oblivious algorithms:






Explicitly, algorithms do not assume any model parameters
Implicitly, algorithms use blocks efficiently on all memory levels

Toy problem #1: Max Subarray




Goal: Given a stock, and its D-performance over the
time, find the time window in which it achieved the best
“market performance”.
Math Problem: Find the subarray of maximum sum.

A = 2 -5 6 1 -2 4 3 -13 9 -6 7
4K

8K

16K

32K

128K

256K

512K

1M

n3

22s

3m

26m

3.5h

28h

--

--

--

n2

0

0

0

1s

26s

106s

7m

28m

An optimal solution
We assume every subsum≠0

A=

<0

>0
Optimum

A = 2 -5 6 1 -2 4 3 -13 9 -6 7

Algorithm



sum=0; max = -1;
For i=1,...,n do


If (sum + A[i] ≤ 0) sum=0;
else sum +=A[i]; MAX{max, sum};

Note:
•Sum < 0 when OPT starts;
•Sum > 0 within OPT

Toy problem #2 : sorting


How to sort tuples (objects) on disk

Memory containing the tuples

A


Key observation:



Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]:




2 random accesses to memory locations A[i] and A[j]

MergeSort Q(n log n) random memory accesses (I/Os ??)

B-trees for sorting ?
Using a well-tuned B-tree library: Berkeley DB


n insertions  Data get distributed arbitrarily !!!
B-tree internal nodes

B-tree leaves
(“tuple pointers")

What about
listing tuples
in order ?

Tuples

Possibly 109 random I/Os = 109 * 5ms  2 months

Binary Merge-Sort
Merge-Sort(A,i,j)
01 if (i < j) then
02
m = (i+j)/2; Divide
Conquer
03
Merge-Sort(A,i,m);
04
Merge-Sort(A,m+1,j);
05
Merge(A,i,m,j)
Combine

Cost of Mergesort on large data




Take Wikipedia in Italian, compute word freq:


n=109 tuples  few Gbs



Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms

Analysis of mergesort on disk:


It is an indirect sort: Q(n log2 n) random I/Os



[5ms] * n log2 n ≈ 1.5 years

In practice, it is faster because of caching...

2 passes (R/W)

Merge-Sort Recursion Tree

log2 N

If the run-size is larger than B (i.e. after first step!!),
fetching all of it in memory for merging does not help
1

2

3

4

5

6

1

2

5

7

9

10

1

2

5

10

2 10
10

2

7

8

13

9

19

10

3

4

11

12

13

15

17

19

6

8

11

12

15

17

11

12

17

How do we deploy
7the
9 disk/mem
13 19 3 features
4 8 15 ? 6

1

5

13 19

5

1

13 19

7 9
9

7

4 15
15

4

M

3

8

12 17

8

3

12 17

6 11
6

11

N/M runs, each sorted in internal memory (no I/Os)

— I/O-cost for merging is ≈ 2 (N/B) log2 (N/M)

Multi-way Merge-Sort



The key is to balance run-size and #runs to merge
Sort N items with main-memory M and disk-pages B:


Pass 1: Produce (N/M) sorted runs.



Pass i: merge X  M/B runs  logM/B N/M passes
INPUT 1

...

INPUT 2

...

OUTPUT

...

INPUT X

Disk

Disk
Main memory buffers of B items

Multiway Merging
Bf1
p1

Bf2
Fetch, if pi = B

p2

min(Bf1[p1],
Bf2[p2],
…,
Bfx[pX])

Bfo
po

Bfx
pX
Current
page

Run 1

Current
page

Flush, if
Bfo full
Current
page

Run 2

Run X=M/B
Out File:
Merged run

EOF

Cost of Multi-way Merge-Sort


Number of passes = logM/B #runs  logM/B N/M



Optimal cost = Q((N/B) logM/B N/M) I/Os

In practice



M/B ≈ 1000  #passes = logM/B N/M  1
One multiway merge  2 passes = few mins
Tuning depends
on disk features

 Large fan-out (M/B) decreases #passes
 Compression would decrease the cost of a pass!

Does compression may help?


Goal: enlarge M and reduce N



#passes = O(logM/B N/M)
Cost of a pass = O(N/B)

Part of Vitter’s paper…
In order to address issues related to:


Disk Striping: sorting easily on D disks



Distribution sort: top-down sorting



Lower Bounds: how much we can go

Toy problem #3: Top-freq elements



Goal: Top queries over a stream of N items (S large).
Math Problem: Find the item y whose frequency is > N/2,
using the smallest space. (i.e. If mode occurs > N/2)

A=b a c c c d c b a a a c c b c c c
.

Algorithm



Use a pair of variables
For each item s of the stream,




if (X==s) then C++
else { C--; if (C==0) X=s; C=1;}

Return X;

Proof

Problems
if ≤ N/2

If X≠y, then every one of y’s
occurrences has a “negative” mate.
Hence these mates should be ≥#y.

As a result, 2 * #occ(y) > N...

Toy problem #4: Indexing


Consider the following TREC collection:







N = 6 * 109 size = 6Gb
n = 106 documents
TotT= 109 (avg term length is 6 chars)
t = 5 * 105 distinct terms

What kind of data structure we build to support
word-based searches ?

Solution 1: Term-Doc matrix
n = 1 million
Antony and Cleopatra

Julius Caesar The Tempest

Hamlet

Othello

Macbeth

t=500K

Antony

1

1

0

0

0

1

Brutus

1

1

0

1

0

0

Caesar

1

1

0

1

1

1

Calpurnia

0

1

0

0

0

0

Cleopatra

1

0

0

0

0

0

mercy

1

0

1

1

1

1

worser

1

0

1

1

1

0

Space is 500Gb !

1 if play contains
word, 0 otherwise

Solution 2: Inverted index
Brutus

2

Calpurnia

1

Caesar

4
2

8

16
32do64
We can
still128
better:

original
3 i.e.
5 3050%
8 13
21 text
34

13 16

1. Typically use about 12 bytes
2. We have 109 total terms  at least 12Gb space
3. Compressing 6Gb documents gets 1.5Gb data
Better index but yet it is >10 times the text !!!!

Please !!
Do not underestimate
the features of disks
in algorithmic design

Algoritmi per IR

Basics + Huffman coding

How much can we compress?
Assuming all input messages are valid, if even one string
is (lossless) compressed, some other must expand.
Take all messages of length n.
Is it possible to compress ALL OF THEM in less bits ?
NO, they are 2n but we have less compressed msg…

n -1

2
i 1

i

2 -2
n

We need to talk
about stochastic sources

Entropy (Shannon, 1948)
For a set of symbols S with probability p(s), the
self information of s is:
i ( s )  log

1
2

 - log

p(s)

2

p(s)

Lower probability  higher information

Entropy is the weighted average of i(s)
H (S ) 



s S

p ( s )  log

1
2

p(s)

bits

Statistical Coding
How do we use probability p(s) to encode s?




Prefix codes and relationship to Entropy
Huffman codes
Arithmetic codes

Uniquely Decodable Codes
A variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?

A uniquely decodable code can always be
uniquely decomposed into their codewords.

Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11
Can be viewed as a binary trie
0

1
0 1

a
0

1

b

c

d

Average Length
For a code C with codeword length L[s], the
average length is defined as
La (C ) 



p ( s )  L[ s ]

s S

We say that a prefix code C is optimal if for
all prefix codes C’, La(C)  La(C’)

A property of optimal codes
Theorem (Kraft-McMillan). For any optimal
uniquely decodable code, it does exist a prefix
code with the same symbol lengths and thus
same average optimal length. And vice versa…

Theorem (golden rule). If C is an optimal prefix
code for a source with probabilities {p1, …, pn}

then pi < pj  L[si] ≥ L[sj]

Relationship to Entropy
Theorem (lower bound, Shannon). For any
probability distribution and any uniquely
decodable code C, we have

H ( S )  La (C )
Theorem (upper bound, Shannon). For any
probability distribution, there exists a prefix
code C such that

La (C )  H ( S )  1

Shannon code
takes log 1/p bits

Huffman Codes
Invented by Huffman as a class assignment in ‘50.
Used in most compression algorithms


gzip, bzip, jpeg (as option), fax compression,…

Properties:




Generates optimal prefix codes
Cheap to encode and decode
La(Huff) = H if probabilities are powers of 2


Otherwise, at most 1 bit more per symbol!!!

Running Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1)
0
(.3)

b(.2)
1

c(.2)
1

1

0
(.5)

d(.5)

0

(1)
a=000, b=001, c=01, d=1
There are 2n-1 “equivalent” Huffman trees

What about ties (and thus, tree depth) ?

Encoding and Decoding
Encoding: Emit the root-to-leaf path leading to
the symbol to be encoded.
Decoding: Start at root and take branch for
each bit received. When at leaf, output its
symbol and return to root.
0

abc... 00000101

101001...  dcb

0

0
(.3)
1

a(.1)

(.5)
1

b(.2)

1

d(.5)

c(.2)

A property on tree contraction

Something like substituting symbols x,y with one new symbol x+y

...by induction, optimality follows…

Optimum vs. Huffman

Model size may be large
Huffman codes can be made succinct in the representation of
the codeword tree, and fast in (de)coding.
Canonical Huffman tree

We store for any level L:

= 00.....0



firstcode[L]



Symbol[L,i], for each i in level L

This is ≤ h2+ |S| log |S| bits

Canonical Huffman
Encoding

1

2

3

4

5

Canonical Huffman
Decoding
firstcode[1]=2
firstcode[2]=1
firstcode[3]=1
firstcode[4]=2
firstcode[5]=0

T=...00010...

Problem with Huffman Coding
Consider a symbol with probability .999. The
self information is
- log(. 999 )  . 00144

If we were to send 1000 such symbols we
might hope to use 1000*.0014 = 1.44 bits.

Using Huffman, we take at least 1 bit per
symbol, so we would require 1000 bits.

What can we do?
Macro-symbol = block of k symbols
 1 extra bit per macro-symbol = 1/k extra-bits per symbol
 Larger model to be transmitted

Shannon took infinite sequences, and k 

∞ !!

In practice, we have:


Model takes |S|k (k * log |S|) + h2



It is H0(SL) ≤ L * Hk(S) + O(k * log |S|), for each k ≤ L

(where h might be |S|)

Compress + Search ?

[Moura et al, 98]

Compressed text derived from a word-based Huffman:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman

“or”

tagging

7 bits

g

a

b

1

Codeword

g

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

b

0

Byte-aligned codeword
a

T = “bzip or not bzip”
1

a

0

b
[]

b

b

0

b
space

bzip
1

a 0 b
[bzip]

g
a
b

g

a
or not

CGrep and other ideas...
P= bzip = 1a 0b

a
b

b
space

bzip

GREP

g
a
b

g

a
or not

T = “bzip or not bzip”

yes
1

C(T)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

You find this at

You find it under my Software projects

Algoritmi per IR

Basic search algorithms:
Single and Multiple-patterns
Mismatches and Edits

Problem 1
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Pattern Matching Problem
Exact match problem: we want to find all the occurrences of the
pattern P[1,m] in the text T[1,n].
T

A

B

P

A

B

C

A

B

D

A

B

 Naïve solution
 For any position i of T, check if T[i,i+m-1]=P[1,m]
 Complexity: O(nm) time

 (Classical) Optimal solutions based on comparisons
 Knuth-Morris-Pratt
 Boyer-Moore
 Complexity: O(n + m) time

Semi-numerical pattern matching




We show methods in which Arithmetic and Bitoperations replace comparisons
We will survey two examples of such methods




The Random Fingerprint method due to Karp and
Rabin
The Shift-And method due to Baeza-Yates and
Gonnet

Rabin-Karp Fingerprint


We will use a class of functions from strings to integers in
order to obtain:






An efficient randomized algorithm that makes an error with
small probability.
A randomized algorithm that never errors whose running
time is efficient with high probability.

We will consider a binary alphabet. (i.e., T={0,1}n)

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers.



Let s be a string of length m

H (s) 







m
i 1

2

m -i

s[ i ]

P=0101
H(P) = 230+221+210+201= 5
s=s’ if and only if H(s)=H(s’)

Definition:
let Tr denote the m length substring of T starting at
position r (i.e., Tr = T[r,r+m-1]).

Arithmetic replaces Comparisons


Strings are also numbers, H: strings → numbers



Exact match = Scan T and compare H(Ti) and H(P)
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

T=10110101
P=0101
H(P) = 5
T=10110101
P= 0101

H(T2) = 6 ≠ H(P)

T=10110101
P=
0101

H(T5) = 5 = H(P)

Match!

Arithmetic replaces Comparisons


We can compute H(Tr) from H(Tr-1)

H (T r )  2 H (T r -1 ) - 2 T ( r - 1)  T ( r  n - 1)
m

T=10110101
T1 = 1 0 1 1
T2 = 0 1 1 0

H (T1 )  H (1011 )  11
H (T 2 )  H ( 0110 )  2  11 - 2  1  0  22 - 16  6  H ( 0110 )
4

Arithmetic replaces Comparisons




A simple efficient algorithm:
Compute H(P) and H(T1)
Run over T




Compute H(Tr) from H(Tr-1) in constant time,
and make the comparisons (i.e., H(P)=H(Tr)).

Total running time O(n+m)?



NO! why?
The problem is that when m is large, it is unreasonable to
assume that each arithmetic operation can be done in O(1)
time.




Values of H() are m-bits long numbers. In general, they are too
BIG to fit in a machine’s word.

IDEA! Let’s use modular arithmetic:
For some prime q, the Karp-Rabin fingerprint of a string s
is defined by Hq(s) = H(s) (mod q)

An example
P=101111
q=7

H(P) = 47
Hq(P) = 47 (mod 7) = 5

Hq(P) can be computed incrementally!
1  2 (mod 7 )  0  2
2  2 (mod 7 )  1  5
5  2 (mod 7 )  1  4
4  2 (mod 7 )  1  2

We can still compute Hq(Tr) from
Hq(Tr-1).
2m (mod q) = 2(2m-1 (mod q)) (mod q)

2  2 (mod 7 )  1  5
5 (mod 7 )  5  H q ( P )

Intermediate values are also small! (< 2q)

Karp-Rabin Fingerprint


How about the comparisons?
Arithmetic:
There is an occurrence of P starting at position r of T if and
only if H(P) = H(Tr)

Modular arithmetic:
If there is an occurrence of P starting at position r of T
then Hq(P) = Hq(Tr)
False match! There are values of q for which the converse is
not true (i.e., P ≠ Tr AND Hq(P) = Hq(Tr))!


Our goal will be to choose a modulus q such that




q is small enough to keep computations efficient. (i.e., Hq()s fit in a
machine word)
q is large enough so that the probability of a false match is kept small

Karp-Rabin fingerprint algorithm




Choose a positive integer I
Pick a random prime q less than or equal to I, and
compute P’s fingerprint – Hq(P).
For each position r in T, compute Hq(Tr) and test to see if
it equals Hq(P). If the numbers are equal either









declare a probable match (randomized algorithm).
or check and declare a definite match (deterministic
algorithm)

Running time: excluding verification O(n+m).
Randomized algorithm is correct w.h.p
Deterministic algorithm whose expected running time is
O(n+m)

Proof on the board

Problem 1: Solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

b
[]

b

0

1

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

The Shift-And method


Define M to be a binary m by n matrix such that:



M(i,j) = 1 iff the first i characters of P exactly match the i
characters of T ending at character j.

i.e., M(i,j) = 1 iff P[1 … i] = T[j-i+1...j]
Example: T = california and P = for

M
n
m

T
P



c

a

l

i

f

o

rj

n

i

a

1

2

3

4

*

5

6

7

*

8

9

1
0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

*

*

f

1

o

2

*0

0

r

3

0

0

* 0 *0

0

0

Oi

0
0

*

* 1 0*
0

1

How does M solve the exact match problem?

How to construct M


We want to exploit the bit-parallelism to compute the j-th
column of M from the j-1-th one




Machines can perform bit and arithmetic operations between
two words in constant time.
Examples:



And(A,B) is bit-wise and between A and B.
BitShift(A) is the value derived by shifting the A’s bits down by
one and setting the first bit to 1.

0
1 
 
 
1 
0
BitShift (  1  )   1 
 
 
0
1 
 
 
1
 
0


Let w be the word size. (e.g., 32 or 64 bits). We’ll assume
m=w. NOTICE: any column of M fits in a memory word.

How to construct M






We want to exploit the bit-parallelism to
compute the j-th column of M from the j-th one
We define the m-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1for the
positions in P where character x appears.
Example:
P = abaac
1 
 
0
U (a )   1 
 
1 
 
0

0
 
1 
U (b )   0 
 
0
 
0

0
 
0
U (c )   0 
 
0
 
1 

How to construct M



Initialize column 0 of M to all zeros
For j > 0, j-th column is obtained by

M ( j )  BitShift ( M ( j - 1)) & U (T [ j ])


For i > 1, Entry M(i,j) = 1 iff
(1) The first i-1 characters of P match the i-1characters of T
ending at character j-1
(2) P[i] = T[j]






⇔ M(i-1,j-1) = 1

the i-th bit of U(T[j]) = 1

BitShift moves bit M(i-1,j-1) in i-th position
AND this with i-th bit of U(T[j]) to estabilish if both are true

An example j=1
1 2 3 4 5 6 7 8 9 10

n
m

1

12345

1

0

P=abaac

2

0

3

0

4

0

5

0

T=xabxabaaca

0
 
0
U (x)  0
 
0
 
0

2

3

4

5

6

7

1 
 
0
BitShift ( M ( 0 )) & U (T [1])   0  &
 
0
 
0

8

9

0 0
   
0 0
0  0
   
0 0
   
0 0

1
0

An example j=2
1 2 3 4 5 6 7 8 9 10

n
m

1

2

12345

1

0

1

P=abaac

2

0

0

3

0

0

4

0

0

5

0

0

T=xabxabaaca

1 
 
0
U (a )   1 
 
1 
 
0

3

4

5

6

7

1 
 
0
BitShift ( M (1)) & U (T [ 2 ])   0  &
 
0
 
0

8

9

1  1 
   
0 0
1    0 
   
1   0 
   
0 0

1
0

An example j=3
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

12345

1

0

1

0

P=abaac

2

0

0

1

3

0

0

0

4

0

0

0

5

0

0

0

T=xabxabaaca

0
 
1 
U (b )   0 
 
0
 
0

4

5

6

7

1 
 
1 
BitShift ( M ( 2 )) & U (T [ 3 ])   0  &
 
0
 
0

8

9

0 0
   
1  1 
0  0
   
0 0
   
0 0

1
0

An example j=9
1 2 3 4 5 6 7 8 9 10

n
m

1

2

3

4

5

6

7

8

9

12345

1

0

1

0

0

1

0

1

1

0

P=abaac

2

0

0

1

0

0

1

0

0

0

3

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

1

0

5

0

0

0

0

0

0

0

0

1

T=xabxabaaca

0
 
0
U (c )   0 
 
0
 
1 

1 
 
1 
BitShift ( M ( 8 )) & U (T [ 9 ])   0  &
 
0
 
1 

0 0
   
0 0
0  0
   
0 0
   
1  1 

1
0

Shift-And method: Complexity








If m<=w, any column and vector U() fit in a
memory word.
 Any step requires O(1) time.
If m>w, any column and vector U() can be
divided in m/w memory words.
 Any step requires O(m/w) time.
Overall O(n(1+m/w)+m) time.
Thus, it is very fast when pattern length is close
to the word size.
 Very often in practice. Recall that w=64 bits in
modern architectures.

Some simple extensions


We want to allow the pattern to contain special
symbols, like [a-f] classes of chars
P = [a-b]baac
1 
 
0
U (a )   1 
 
1 
 
0



1 
 
1 
U (b )   0 
 
0
 
0

What about ‘?’, ‘[^…]’ (not).

0
 
0
U (c )   0 
 
0
 
1 

Problem 1: An other solution
Dictionary

P = bzip = 1a 0b

a

bzip
not
or

b

b
space

bzip

space

1

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”

yes

g

a
[or]
0

1

no

b
[]

b

0

1

no

yes
a 0 b
[bzip]

Speed ≈ Compression ratio

Problem 2
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring

a

bzip
not
or

P=o

b

b
space

bzip

space

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

g

a
b

a
or not

S = “bzip or not bzip” yes
1

g

a
[or]
0

1

b
[]

b

0

1

not= 1 g 0 g 0 a
or = 1 g 0 a 0 b

a 0 b
[bzip]

Speed ≈ Compression ratio? No! Why?
A scan of C(s) for each term that contains P

Multi-Pattern Matching Problem
Given a set of patterns P = {P1,P2,…,Pl} of total length m, we want
to find all the occurrences of those patterns in the text T[1,n].
T

A

B

P1

C

A

C

A

P2

B
D

D

A

B

A

 Naïve solution
 Use an (optimal) Exact Matching Algorithm searching each
pattern in P
 Complexity: O(nl+m) time, not good with many patterns

 Optimal solution due to Aho and Corasick
 Complexity: O(n + l + m) time

A simple extention of Shift-And



S is the concatenation of the patterns in P
R is a bitmap of lenght m.




R[i] = 1 iff S[i] is the first symbol of a pattern

Use a variant of Shift-And method searching for
S




For any symbol c, U’(c) = U(c) and R
 U’(c)[i] = 1iff S[i]=c and is the first symbol of a
pattern
For any step j,
 compute M(j)
 then M(j) OR U’(T[j]). Why?
 Set to 1 the first bit of each pattern that start with
T[j]
 Check if there are occurrences ending in j. How?

Problem 3
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
at most k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b

g

1

[]

g

0

g

[not]

0

a

a
b

g

a
or not

S = “bzip or not bzip”
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

Agrep: Shift-And method with errors




We extend the Shift-And method for finding inexact
occurrences of a pattern in a text.
Example:
T = aatatccacaa
P = atcgaa
P appears in T with 2 mismatches starting at position
4, it also occurs with 4 mismatches starting at
position 2.
aatatccacaa
atcgaa

aatatccacaa
atcgaa

Agrep




Our current goal given k find all the occurrences of P
in T with up to k mismatches
We define the matrix Ml to be an m by n binary
matrix, such that:

Ml(i,j) = 1 iff
There are no more than l mismatches between the
first i characters of P match the i characters up
through character j of T.



What is M0?
How does Mk solve the k-mismatch problem?

Computing Mk






We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)
For all l initialize Ml(0) to the zero vector.
In order to compute Ml(j), we observe that there is a
match iff

Computing Ml: case 1


The first i-1 characters of P match a substring of T ending
at j-1, with at most l mismatches, and the next pair of
characters in P and T are equal.

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M ( j - 1))  U (T [ j ])
l

Computing Ml: case 2


The first i-1 characters of P match a substring of T ending
at j-1, with at most l-1 mismatches.

j-1

*

*

*

*

*

*

*

*

*

*

i-1

BitShift ( M

l -1

( j - 1))

Computing Ml


We compute Ml for all l=0, … ,k.
For each j compute M(j), M1(j), … , Mk(j)



For all l initialize Ml(0) to the zero vector.





In order to compute Ml(j), we observe that there is a
match iff

M ( j) 
l

[ BitShift ( M ( j - 1))  U (T ( j ))] 
l

BitShift ( M

l -1

( j - 1))

Example M1
1 2 3 4 5 6 7 8 910

M1=

T=xabxabaaca
P=
abaad

1

2

3

4

5

6

7

8

9

1
0

1

1

1

1

1

1

1

1

1

1

1

2

0

0

1

0

0

1

0

1

1

0

3

0

0

0

1

0

0

1

0

0

1

4

0

0

0

0

1

0

0

1

0

0

5

0

0

0

0

0

0

0

0

1

0

M0=

1

2

3

4

5

6

7

8

9

1
0

1

0

1

0

0

1

0

1

1

0

1

2

0

0

1

0

0

1

0

0

0

0

3

0

0

0

0

0

0

1

0

0

0

4

0

0

0

0

0

0

0

1

0

0

5

0

0

0

0

0

0

0

0

0

0

How much do we pay?





The running time is O(kn(1+m/w)
Again, the method is practically efficient for
small m.
Still only a O(k) columns of M are needed at any
given time. Hence, the space used by the
algorithm is O(k) memory words.

Problem 3: Solution
Dictionary

Given a pattern P find
all the occurrences in S
of all terms containing
P as substring allowing
k mismatches

a

bzip
not
or

b

C(S)
1

a 0 b
[bzip]

b
[]

1

1

b
[]

g

0

g

[not]

g

1

yes
0

a

a

g

b

a
or not

S = “bzip or not bzip” yes
1

space

bzip

space

P = bot k=2

b

g

a
[or]
0

1

b
[]

b

0

1

a 0 b
[bzip]

not= 1 g 0 g 0 a

Agrep: more sophisticated operations


The Shift-And method can solve other ops


The edit distance between two strings p and s is
d(p,s) = minimum numbers of operations
needed to transform p into s via three ops:







Insertion: insert a symbol in p
Delection: delete a symbol from p
Substitution: change a symbol in p with a different one

Example: d(ananas,banane) = 3

Search by regular expressions


Example: (a|b)?(abc|a)

Algoritmi per IR

Some thoughts
on
some peculiar compressors

Variations…
Canonical Huffman still needs to know the codeword
lengths, and thus build the tree…
This may be extremely time/space costly when you
deal with Gbs of textual data
A simple algorithm

Sort pi in decreasing order, and encode si via
the variable-length code for the integer i.

g-code for integer encoding
0 000 ...........0 x in b in ary
Length-1


x > 0 and Length = log2 x +1
e.g., 9 represented as <000,1001>.



g-code for x takes 2 log2 x +1 bits

(ie. factor of 2 from optimal)



Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding…


Given the following sequence of g-coded
integers, reconstruct the original sequence:

0001000001100110000011101100111

8

6

3

59

7

Analysis
Sort pi in decreasing order, and encode si via
the variable-length code g(i).

Recall that: |g(i)| ≤ 2 * log i + 1
How much good is this approach wrt Huffman?
Compression ratio ≤ 2 * H0(s) + 1
Key fact:
1 ≥Si=1,...,x pi ≥ x * px  x ≤ 1/px

How good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
The cost of the encoding is (recall i ≤ 1/pi):



p i  | g ( i ) |

i  1 ,..., S

This is:



i  1 ,.., S

p i  [ 2 * log

1

 1]

pi

 2 * H 0 ( X ) 1
No much worse than Huffman,
and improvable to H0(X) + 2 + ....

A better encoding


Byte-aligned and tagged Huffman






128-ary Huffman tree
First bit of the first byte is tagged
Configurations on 7-bits: just those of Huffman

End-tagged dense code


The rank r is mapped to r-th binary sequence on 7*k bits



First bit of the last byte is tagged

A better encoding
Surprising changes



It is a prefix-code
Better compression: it uses all 7-bits configurations

(s,c)-dense codes
Distribution of words is skewed: 1/iq, where 1 < q < 2


A new concept: Continuers vs Stoppers




The main idea is:






Previously we used: s = c = 128

s + c = 256 (we are playing with 8 bits)
Thus s items are encoded with 1 byte
And s*c with 2 bytes, s * c2 on 3 bytes, ...

An example




5000 distinct words
ETDC encodes 128 + 1282 = 16512 words on 2 bytes
(230,26)-dense code encodes 230 + 230*26 = 6210 on 2
bytes, hence more on 1 byte and thus if skewed...

Optimal (s,c)-dense codes
Find the optimal s, by assuming c = 128-s.


Brute-force approach



Binary search:


On real distributions, it seems that one unique minimum

Ks = max codeword length
Fsk = cum. prob. symb. whose |cw| <= k

Experiments: (s,c)-DC much interesting…

Search is 6% faster than
byte-aligned Huffword

Streaming compression
Still you need to determine and sort all terms….
Can we do everything in one pass ?


Move-to-Front (MTF):






As a freq-sorting approximator
As a caching strategy
As a compressor

Run-Length-Encoding (RLE):


FAX compression

Move to Front Coding
Transforms a char sequence into an integer
sequence, that can then be var-length coded




Start with the list of symbols L=[a,b,c,d,…]
For each input symbol s
1) output the position of s in L
2) move s to the front of L

There is a memory
Properties:

Exploit temporal locality, and it is dynamic


X = 1n 2n 3n… nn  Huff = O(n2 log n), MTF = O(n log n) + n2

No much worse than Huffman
...but it may be far better

MTF: how good is it ?
Encode the integers via d-coding:
|g(i)| ≤ 2 * log i + 1
Put S in the front and consider the cost of encoding:
S

O ( S log S ) 



nx

g(p - p )

x 1 i  2

x

x

i

i -1

S

By Jensen’s:

 O ( S log S ) 

n
x 1

x

[ 2 * log

N

 1]

nx

 O ( S log S )  N * [ 2 * H 0 ( X )  1]
L a [ mtf ]  2 * H 0 ( X )  O (1)

MTF: higher compression
To achieve higher compression we consider words (and
separators) as symbols to be encoded
How to keep efficiently the MTF-list:

Search tree





Hash Table







Leaves contain the symbols, ordered as in the MTF-List
Nodes contain the size of their descending subtree
key is a symbol
data is a pointer to the corresponding tree leaves

Each tree operation takes O(log S) time
Total is O(n log S), where n = #symbols to be compressed

Run Length Encoding (RLE)
If spatial locality is very high, then
abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)
In case of binary strings  just numbers and one bit
Properties:




Exploit spatial locality, and it is a dynamic code
X = 1n 2n 3n… nn 

There is a memory

Huff(X) = Q(n2 log n) > Rle(X) = Q(n log n)

Algoritmi per IR

Arithmetic coding

Arithmetic Coding: Introduction
Allows using “fractional” parts of bits!!
Used in PPM, JPEG/MPEG (as option), Bzip
More time costly than Huffman, but integer
implementation is not too bad.

Arithmetic Coding (message intervals)
Assign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).

e.g.
1.0

c = .3

0.7

i -1

f (i ) 



p( j)

j 1

b = .5
0.2
0.0

f(a) = .0, f(b) = .2, f(c) = .7
a = .2

The interval for a particular symbol will be called
the symbol interval (e.g for b it is [.2,.7))

Arithmetic Coding: Encoding Example
Coding the message sequence: bac
1.0

0.7
c = .3

0.7

c = .3
0.55

0.0

a = .2

0.3
0.2

c = .3

0.27
b = .5

b = .5
0.2

0.3

a = .2

b = .5
0.22

a = .2

0.2

The final sequence interval is [.27,.3)

Arithmetic Coding
To code a sequence of symbols c with probabilities

p[c] use the following:
l 0  0 l i  l i -1  s i -1 * f c i 

s0  1

s i  s i -1 * p c i 

f[c] is the cumulative prob. up to symbol c (not included)
Final interval size is

n

sn 

 p c 
i

i 1

The interval for a message sequence will be called the
sequence interval

Uniquely defining an interval
Important property: The intervals for distinct
messages of length n will never overlap
Therefore by specifying any number in the
final interval uniquely determines the msg.
Decoding is similar to encoding, but on each
step need to determine what the message
value is and then reduce interval

Arithmetic Coding: Decoding Example
Decoding the number .49, knowing the
message is of length 3:
1.0

0.7
c = .3

0.7
0.49

b = .5

c = .3

0.55
0.49

0.2

0.0

0.55

b = .5

0.3
a = .2

The message is bbc.

0.2

0.49
0.475

c = .3

b = .5
0.35

a = .2

0.3

a = .2

Representing a real number
Binary fractional representation:
. 75

 . 11

1/ 3

 . 01 01

11 / 16

 . 1011

Algorithm
1.
x = 2 *x
2.
If x < 1 output 0
3.
else x = x - 1; output 1

So how about just using the shortest binary
fractional representation in the sequence
interval.
e.g. [0,.33) = .01

[.33,.66) = .1 [.66,1) = .11

Representing a code interval
Can view binary fractional numbers as
intervals by considering all completions.
m in

m ax

interval

.11

.110

.111

[.75 ,1.0 )

.101

.1010

.1011

[.625 , .75 )

We will call this the code interval.

Selecting the code interval
To find a prefix code, find a binary fractional number
whose code interval is contained in the sequence
interval (dyadic number).

Sequence Interval

.79

.75
Code Interval (.101)

.61

.625

Can use L + s/2 truncated to 1 + log (1/s) bits

Bound on Arithmetic length

Note that –log s+1 = log (2/s)

Bound on Length
Theorem: For a text of length n, the
Arithmetic encoder generates at most
1 + log (1/s) =
= 1 +  log ∏ (1/pi) 
≤2+∑

j=1,n

log (1/pi)

= 2 + ∑k=1,|S| npk log (1/pk)
= 2 + n H0 bits

nH0 + 0.02 n bits in practice
because of rounding

Integer Arithmetic Coding
Problem is that operations on arbitrary
precision real numbers is expensive.
Key Ideas of integer version:




Keep integers in range [0..R) where R=2k
Use rounding to generate integer interval
Whenever sequence intervals falls into top,
bottom or middle half, expand the interval
by a factor 2

Integer Arithmetic is an approximation

Integer Arithmetic (scaling)
If l  R/2 then (top half)
Output 1 followed by m 0s
m=0
Message interval is expanded by 2

All other cases,
just continue...

If u < R/2 then (bottom half)
Output 0 followed by m 1s
m=0
Message interval is expanded by 2

If l  R/4 and u < 3R/4 then (middle half)
Increment m
Message interval is expanded by 2

You find this at

Arithmetic ToolBox
As a state machine

L+s

c

s’
L’
L

(p1,....,pS )
c

ATB

ATB

(L,s)

(L’,s’)

Therefore, even the distribution can change over time

K-th order models: PPM
Use previous k characters as the context.



Makes use of conditional probabilities
This is the changing distribution

Base probabilities on counts:
e.g. if seen th 12 times followed by e 7 times, then
the conditional probability p(e|th) = 7/12.

Need to keep k small so that dictionary does
not get too large (typically less than 8).

PPM: Partial Matching
Problem: What do we do if we have not seen context
followed by character before?


Cannot code 0 probabilities!

The key idea of PPM is to reduce context size if
previous match has not been seen.


If character has not been seen before with current context of
size 3, send an escape-msg and then try context of size 2,
and then again an escape-msg and context of size 1, ….

Keep statistics for each context size < k
The escape is a special character with some probability.


Different variants of PPM use different heuristics for
the probability.

PPM + Arithmetic ToolBox
L+s

s s’
L’
L

p[ s|context ]

s = c or esc

ATB

ATB

(L,s)

(L’,s’)

Encoder and Decoder must know the protocol for selecting
the same conditional probability distribution (PPM-variant)

PPM: Example Contexts
Context
Empty

Counts
A
B
C
$

=
=
=
=

4
2
5
3

Context
A
B
C

Counts
C
$
A
$
A
B
C
$

=
=
=
=
=
=
=
=

3
1
2
1
1
2
2
3

Context
AC

BA
CA
CB
CC

String = ACCBACCACBA B

k=2

Counts
B
C
$
C
$
C
$
A
$
A
B
$

=
=
=
=
=
=
=
=
=
=
=
=

1
2
2
1
1
1
1
2
1
1
1
2

You find this at: compression.ru/ds/

Algoritmi per IR

Dictionary-based algorithms

Lempel-Ziv Algorithms
Keep a “dictionary” of recently-seen strings.
The differences are:
 How the dictionary is stored
 How it is extended
 How it is indexed
 How elements are removed

No explicit
frequency estimation

LZ-algos are asymptotically optimal, i.e. their
compression ratio goes to H(S) for n   !!

LZ77
a a c a a c a b c a b a b a c
Dictionary
???
Cursor
(all substrings starting here)

<2,3,c>

Algorithm’s step:


Output
d = distance of copied string wrt current position
len = length of longest match
c = next char in text beyond longest match



Advance by len + 1

A buffer “window” has fixed length and moves

Example: LZ77 with window
a a c a a c a b c a b a a a c

(0,0,a)

a a c a a c a b c a b a a a c

(1,1,c)

a a c a a c a b c a b a a a c

(3,4,b)

a a c a a c a b c a b a a a c

(3,3,a)

a a c a a c a b c a b a a a c

(1,2,c)

Window size = 6
Longest match

within W

Next character

LZ77 Decoding
Decoder keeps same dictionary window as encoder.


Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed)


E.g. seen = abcd, next codeword is (2,9,e)



Simply copy starting at the cursor
for (i = 0; i < len; i++)
out[cursor+i] = out[cursor-d+i]



Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip
LZSS: Output one of the following formats
(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so
that next match is better
Hash Table for speed-up searches on triplets
Triples are coded with Huffman’s code

LZ78
Dictionary:
 substrings stored in a trie (each has an id).
Coding loop:
 find the longest match S in the dictionary
 Output its id and the next character c after
the match in the input string
 Add the substring Sc to the dictionary
Decoding:
 builds the same dictionary and looks at ids

LZ78: Coding Example
Output

Dict.

a a b a a c a b c a b c b

(0,a) 1 = a

a a b a a c a b c a b c b

(1,b) 2 = ab

a a b a a c a b c a b c b

(1,a) 3 = aa

a a b a a c a b c a b c b

(0,c) 4 = c

a a b a a c a b c a b c b

(2,c) 5 = abc

a a b a a c a b c a b c b

(5,b) 6 = abcb

LZ78: Decoding Example
Dict.

Input
(0,a) a

1 = a

(1,b) a a b

2 = ab

(1,a) a a b a a

3 = aa

(0,c) a a b a a c

4 = c

(2,c) a a b a a c a b c

5 = abc

(5,b) a a b a a c a b c a b c b

6 = abcb

LZW (Lempel-Ziv-Welch)
Don’t send extra character c, but still add Sc to
the dictionary.
Dictionary:
 initialized with 256 ascii entries (e.g. a = 112)
Decoder is one step behind the coder since it
does not know c


There is an issue for strings of the form
SSc where S[0] = c, and these are handled specially!!!

LZW: Encoding Example
Output

Dict.

a a b a a c a b a b a c b

112 256=aa

a a b a a c a b a b a c b

112 257=ab

a a b a a c a b a b a c b

113

258=ba

a a b a a c a b a b a c b

256

259=aac

a a b a a c a b a b a c b

114

260=ca

a a b a a c a b a b a c b

257

261=aba

a a b a a c a b a b a c b

261

262=abac

a a b a a c a b a b a c b

114

263=cb

LZW: Decoding Example
Input
112

Dict
a

112

a a

256=aa

113

a a b

257=ab

256

a a b a a

258=ba

114

a a b a a c

259=aac

257

a a b a a c a b ?

260=ca

261

261

114

a a b a a c a b a b

261=aba

one
step
later

LZ78 and LZW issues
How do we keep the dictionary small?
 Throw the dictionary away when it reaches a
certain size (used in GIF)




Throw the dictionary away when it is no
longer effective at compressing (e.g. compress)
Throw the least-recently-used (LRU) entry
away when it reaches a certain size (used in
BTLZ, the British Telecom standard)

You find this at: www.gzip.org/zlib/

Algoritmi per IR

Burrows-Wheeler Transform

The big (unconscious) step...

The Burrows-Wheeler Transform
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi

Sort the rows

F
#
i
i
i
i
m
p
p
s
s
s
s

(1994)

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

T

A famous example

Much
longer...

A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i



F mapping

How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...

Take two equal L’s chars
Rotate rightward their rows
Same relative order !!

The BWT is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown

mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m

L
i
p
s
s
m
#
p
i
s
s
i
i

Two key properties:
1. LF-array maps L’s to F’s chars
2. L[ i ] precedes F[ i ] in T
Reconstruct T backward:

T = .... i ppi #

InvertBWT(L)
Compute LF[0,n-1];
r = 0; i = n;
while (i>0) {
T[i] = L[r];
r = LF[r]; i--;
}

How to compute the BWT ?
SA

BWT matrix

12

#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m

11
8
5
2
1
10
9
7
4
6
3

L
i
p
s
s
m
#
p
i
s
s
i
i

We said that: L[i] precedes F[i] in T

L[3] = T[ 7 ]
Given SA and T, we have L[i] = T[SA[i]-1]

How to construct SA from T ?
SA
12 #
11
8
5
2
1
10
9
7
4
6
3

i#
ippi#
issippi#
ississippi#
mississippi
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

Elegant but inefficient

Input: T = mississippi#

Obvious inefficiencies:
• Q(n2 log n) time in the worst-case
• Q(n log n) cache misses or I/O faults

Many algorithms, now...

Compressing L seems promising...
Key observation:


L is locally homogeneous

L is highly compressible

Algorithm Bzip :
 Move-to-Front coding of L

 Run-Length coding
 Statistical coder
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

An encoding example
T = mississippimississippimississippi
L = ipppssssssmmmii#pppiiissssssiiiiii

# at 16
Mtf = [i,m,p,s]

Mtf = 020030000030030200300300000100000

Mtf = 030040000040040300400400000200000
Bin(6)=110, Wheeler’s code

RLE0 = 03141041403141410210

Alphabet
|S|+1

Bzip2-output = Arithmetic/Huffman on |S|+1 symbols...
... plus g(16), plus the original Mtf-list (i,m,p,s)

You find this in your Linux distribution

Algoritmi per IR

Web-graph Compression

The Web’s Characteristics


Size






1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!

Change



8% new pages, 25% new links change weekly
Life time of about 10 days

The Bow Tie

Some definitions


Weakly connected components (WCC)




Set of nodes such that from any node can go to any node via an
undirected path.

Strongly connected components (SCC)


Set of nodes such that from any node can go to any node via a
directed path.

WCC

SCC

Observing Web Graph






We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by




Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links

Why is it interesting?


Largest artifact ever conceived by the human



Exploit its structure of the Web for









Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization

Predict the evolution of the Web


Sociological understanding

Many other large graphs…


Physical network graph





The “cosine” graph (undirected, weighted)





V = static web pages
E = semantic distance between pages

Query-Log graph (bipartite, weighted)





V = Routers
E = communication links

V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some
user who issued q

Social graph (undirected, unweighted)



V = users
E = (x,y) if x knows y (facebook, address book, email,..)

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)
Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1

The In-degree distribution

Altavista crawl, 1999
Indegree follows power law distribution

WebBase Crawl 2001
Pr[ in - degree ( u )  k ] 

a  2.1

1

k

a

A Picture of the Web Graph

Definition
Directed graph G = (V,E)


V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:


Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1



Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).



Similarity: pages close in lexicographic order tend to share
many outgoing lists

A Picture of the Web Graph
j

i

21 millions of pages, 150millions of links

URL-sorting
Berkeley
Stanford

URL compression + Delta encoding

The library WebGraph
Uncompressed
adjacency list

Adjacency list with
compressed gaps

(locality)

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Copy-lists

Reference chains
possibly limited

Uncompressed
adjacency list

D
Adjacency list with
copy lists
(similarity)

Each bit of y informs whether the corresponding successor of y is also a
successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.

Copy-blocks = RLE(Copy-list)
Adjacency list with
copy lists.

Adjacency list with
copy blocks
(RLE on bit sequences)

The first copy block is 0 if the copy list starts with 0;
The last block is omitted (we know the length…);
The length is decremented by one for all blocks

This is a Java and C++ lib
(≈3 bits/edge)

Extra-nodes: Compressing Intervals
Adjacency list with
copy blocks.

Consecutivity

3

in extra-nodes

Intervals: use their left extreme and length
Int. length: decremented by Lmin = 2
Residuals: differences between residuals,
or the source

0 = (15-15)*2 (positive)
2 = (23-19)-2 (jump >= 2)
600 = (316-16)*2
3 = |13-15|*2-1 (negative)
3018 = 3041-22-1

Algoritmi per IR

Compression of file collections

Background
data
knowledge
about data
at receiver

sender

receiver

 network links are getting faster and faster but
 many clients still connected by fairly slow links (mobile?)
 people wish to send more and more data

how can we make this transparent to the user?

Two standard techniques




caching: “avoid sending the same object again”


done on the basis of objects



only works if objects completely unchanged



How about objects that are slightly changed?

compression: “remove redundancy in transmitted data”


avoid repeated substrings in data



can be extended to history of past transmissions



How if the sender has never seen data at receiver ?

(overhead)

Types of Techniques


Common knowledge between sender & receiver




Unstructured file: delta compression

“partial” knowledge



Unstructured files: file synchronization
Record-based data: set reconciliation

Formalization


Delta compression


Compress file f deploying file f’



Compress a group of files





Speed-up web access by sending differences between the requested
page and the ones available in cache

File synchronization





[diff, zdelta, REBL,…]

[rsynch, zsync]

Client updates old file fold with fnew available on a server
Mirroring, Shared Crawling, Content Distr. Net

Set reconciliation


Client updates structured old file fold with fnew available on a server



Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression

(one-to-one)

Problem: We have two files fknown and fnew and the goal is
to compute a file fd of minimum size such that fnew can
be derived from fknown and fd


Assume that block moves and copies are allowed



Find an optimal covering set of fnew based on fknown



LZ77-scheme provides and efficient, optimal solution




fknown is “previously encoded text”, compress fknownfnew starting from fnew

zdelta is one of the best implementations
Emacs size

Emacs time

uncompr

27Mb

---

gzip

8Mb

35 secs

zdelta

1.5Mb

42 secs

Efficient Web Access
Dual proxy architecture: pair of proxies located on each side of the slow
link use a proprietary protocol to increase performance over this link

Client
reference



request
Slow-link

Delta-encoding

Proxy
reference

request
Fast-link

web

Page

Use zdelta to reduce traffic:


Old version available at both proxies



Restricted to pages already visited (30% hits), URL-prefix match
Small cache

Cluster-based delta compression
Problem: We wish to compress a group of files F


Useful on a dynamic collection of web pages, back-ups, …



Apply pairwise zdelta: find for each f  F a good reference



Reduction to the Min Branching problem on DAGs


Build a weighted graph GF, nodes=files, weights= zdelta-size



Insert a dummy node connected to all, and weights are gzip-coding



Compute the min branching = directed spanning tree of min tot cost, covering
G’s nodes.

0

620

2

20

123

2000

20
220

3

1
5

space

time

uncompr

30Mb

---

tgz

20%

linear

THIS

8%

quadratic

Improvement

What about
many-to-one compression?

(group of files)

Problem: Constructing G is very costly, n2 edge calculations (zdelta exec)


We wish to exploit some pruning approach


Collection analysis: Cluster the files that appear similar and thus
good candidates for zdelta-compression. Build a sparse weighted graph
G’F containing only edges between those pairs of files



Assign weights: Estimate appropriate edge weights for G’F thus saving
zdelta execution. Nonetheless, strict n2 time
space

time

uncompr

260Mb

---

tgz

12%

2 mins

THIS

8%

16 mins

Algoritmi per IR

File Synchronization

File synch: The problem
request
f_new

update

Server





f_old
Client

client wants to update an out-dated file
server has new file but does not know the old file
update without sending entire f_new (using similarity)
rsync: file synch tool, distributed with Linux

Delta compression is a sort of local synch
Since the server has both copies of the files

The rsync algorithm
hashes
f_new
Server

encoded file

f_old
Client

The rsync algorithm

(contd)



simple, widely used, single roundtrip



optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals



choice of block size problematic (default: max{700, √n} bytes)



not good in theory: granularity of changes may disrupt use of blocks

Rsync: some experiments

gcc size
total
27288
gzip
7563
zdelta
227
rsync
964

emacs size
27326
8577
1431
4452

Compressed size in KB (slightly outdated numbers)

Factor 3-5 gap between rsync and zdelta !!

A new framework: zsync

Server sends hashes (unlike the client in rsync), clients checks them
Server deploys the common fref to compress the new ftar (rsync compress just it).

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Next lecture

Set reconciliation
Problem: Given two sets SA and SB of integer values located
on two machines A and B, determine the difference

between the two sets at one or both of the machines.

Requirements: The cost should be proportional to the size
k of the difference, where k may or may not be known in
advance to the parties.

Note:
 set reconciliation is “easier” than file sync [it is record-based]

Not perfectly true but...

Recurring minimum for
improving the estimate
+ 2 SBF

A multi-round protocol

k blocks of n/k elems

Log n/k levels

If distance k, then on each level  k hashes not find a match in the other file.
The communication complexity is O(k lg n lg(n/k)) bits

Algoritmi per IR

Text Indexing

What do we mean by “Indexing” ?
 Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.

 Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree,...

How do we solve Prefix Search?

Trie !!
Array of string pointers !!
What about Substring Search ?

Basic notation and facts
Pattern P occurs at position i of T
iff

P is a prefix of the i-th suffix of T (ie. T[i,N])

i P

T

T[i,N]
Occurrences of P in T = All suffixes of T having P as a prefix

P = si

T = mississippi
mississippi

4,7

SUF(T) = Sorted set of suffixes of T

Reduction

From substring search
To prefix search

The Suffix Tree
#

0

s

i

1
ssi

ppi#

4

#

1

p

i

3

2

1

ppi#

6

pi#

7

8
5

T# = mississippi#
2 4 6 8 10

si

ppi#

i#

ppi#

11

mississippi#

12

2

1 10

9

4

3

The Suffix Array
Prop 1. All suffixes in SUF(T) with prefix P are contiguous.
Prop 2. Starting position is the lexicographic one of P.
5

Q(N2) space
SA

SUF(T)

12
11
8
5
2
1
10
9
7
4
6
3

#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#

T = mississippi#
suffix pointer

P=si
Suffix Array
• SA: Q(N log2 N) bits
• Text T: N chars
 In practice, a total of 5N bytes

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
P is larger
2 accesses per step

P = si

Searching a pattern
Indirected binary search on SA: O(p) time per suffix cmp

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#

P is smaller

P = si

Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O(p) char cmp
 overall, O(p log2 N) time

+ [Manber-Myers, ’90]
|S| [Cole et al, ’06]

Locating the occurrences
SA

occ=2

T = mississippi#

12
11
8
5
where
2
1
10
si#
9
7 sippi
4 sissippi
6
3
si$

4

7

Suffix Array search
• O (p + log2 N + occ) time

#<

S<$
Suffix Trays: O (p + log2 |S| + occ)
[Cole et al., ‘06]
String B-tree

[Ferragina-Grossi, ’95]

Self-adjusting Suffix Arrays
[Ciriani et al., ’02]

Text mining
Lcp[1,N-1] = longest-common-prefix between suffixes adjacent in SA

Lcp
0
0
1
4
0
0
1
0
2
1
3

SA
12
11
8
5
2
1
10
9
7
4
6
3

T = mississippi#
4 67 9

issippi
ississippi
• How long is the common prefix between T[i,...] and T[j,...] ?
• Min of the subarray Lcp[h,k-1] s.t. SA[h]=i and SA[k]=j.
• Does it exist a repeated substring of length ≥ L ?
• Search for Lcp[i] ≥ L
• Does it exist a substring of length ≥ L occurring ≥ C times ?
• Search for Lcp[i,i+C-2] whose entries are ≥ L